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Abstract. Large-scale data resulting from users online interactions provide the ultimate 
source of information to study emergent social phenomena on the Web. From individual 
actions of users to observable collective behaviors, different mechanisms involving emotions 
expressed in the posted text play a role. Here we combine approaches of statistical physics 
with machine-learning methods of text analysis to study emergence of the emotional behavior 
among Web users. Mapping the high-resolution data from digg.com onto bipartite network 
of users and their comments onto posted stories, we identify user communities centered 
around certain popular posts and determine emotional contents of the related comments by 
the emotion-classifier developed for this type of texts. Applied over different time periods, 
this framework reveals strong correlations between the excess of negative emotions and 
the evolution of communities. We observe avalanches of emotional comments exhibiting 
significant self-organized critical behavior and temporal correlations. To explore robustness of 
these critical states, we design a network automaton model on realistic network connections 
and several control parameters, which can be inferred from the dataset. Dissemination of 
emotions by a small fraction of very active users appears to critically tune the collective states. 
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1. Introduction 

Online social interactions among users of different Web portals, which are mediated by the 
posted material (text on blogs, pictures, movies, etc.), or via direct exchange of messages 
on friendship networks, represent a prominent way of human communications. It has been 
recognized recently 03 [2j that the unsupervised online interactions, involving ever larger 
number of users through the self-organized dynamics may lead to new social phenomena 
on the Web. Understanding the emergent collective behavior of users thus appears as one of 
the central topics of the contemporary science of the Web, beside the structure and the security 
issues flUH. 

Role of emotions known in conventional social contacts has been increasingly perceived 
in the Web-based communications. The empirical analysis of sentiment and mood, and 
opinion mining through user-generated textual data are currently developing research fields 
J4l[5l[6l. Different dimensions of the emotional state, i.e., arousal, valence and dominance of 
an individual user can be measured in the laboratory [6J. The amount of emotions expressed 
in a written text and transfered from/to a user have been studied [|7l [6J. On the other hand, 
methods are devised to measure group emotional states and public mood, e.g., related to 
a given event [[8l |H . However, the emergence of the collective emotional states from the 
actions of individual users over time is a nonlinear dynamical process, that has not been well 
understood. 

Physics and computer science research of the Web have been independently developing 
own methodologies and goals. For instance, large efforts in the computer science are devoted 
to improve the algorithms to retrieve information and sentiment from written text ifTOl |7]|. 
Different methods have been developed for social sciences to analyze particular phenomena 
ifTTl [T2l [T3l . Whereas, physics research is chiefly focused on the underlying processes from 
the perspective of complex dynamical systems |fl4l[T5l[T6l . The quantitative approaches are 
based on the network representations and application of the graph theory (for a recent review 
see IfTTl ). 

Here we use the theory of complex networks and the methods of statistical physics 
of self-organized dynamical phenomena, which we combine with recent developments in 
computer science focused towards the emotion contents of the text, and study the emergence 
of collective emotional states among Web users. This combined research framework offers 
new insight into genesis and structure of the collective emotional states. At the same 
time it introduces a set of quantitative measures and parameters which characterize users 
behavior and can be inferred from the information embedded in the original data. For further 
understanding of the observed collective states, we design a network- automaton model on the 
realistic network structure, within which we tune the control parameters of the dynamics and 
monitor their effects. 
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The organization of the paper is as follows. In section [2] we explain our methodology 
and the structure collected data needed for the quantitative analysis of this type. In section 
[3] we define the quantities necessary to characterize the collective emotional states of users 
and perform the systematic analysis of the data to determine these quantities. In section|4]the 
network-automaton model is introduced and its parameters estimated from the empirical data. 
The results of simulations are presented. Finally, a short summary and conclusions are given 
in section |5l Some related technical details are given in the appendix. 

2. Data Structure and Methodology 

Fine structure of the data is required for this type of analysis. Specifically, we consider a 
large dataset collected from digg.com, described in the Appendix. Typically, a user posts a 
story by providing a link to other media and offering a short description. Then all users may 
read the story as well as already existing comments, and post own comments, dig (approve), 
or bury (disapprove) the story. Each user has a unique ID. Every action of a user is registered 
with high temporal resolution and clearly attributed to the post (main story) and/or to a given 
comment on that story. Our data also contain full text of all comments. 

Developing an emotion classifier, which is based on machine-learning methods lfT8l and 
trained on large dataset of blog texts (see more details in the Appendix), we determine the 
emotional content of each comment in our dataset. In particular, a probability is determined 
for each text to be classified first as either subjective, i.e., having emotional content, or 
otherwise objective. Then the subjective texts are further classified for containing either 
negative or positive emotional content. Owing to the high resolution of our data, we are 
able to study quantitatively the temporal evolution of connected events and determine how 
the emotions expressed in user's comments affect it. Here we are particularly interested in 
the collective dynamical effects that emerge through the actions of individual users. For this 
purpose we select a subset consisting of popular posts with large number of comments, on 
which we find over 50% comment-on-comment actions. The dataset, termed discussion- 
driven-Diggs (ddDiggs), consists of A^=3984 stories on which /Vc=9 17708 comments are 
written by /V[/=82201 users. 

Mapping the data onto a bipartite network is a first step in our methodology. Two 
partition nodes are user nodes and posts-and-comments nodes, respectively. The resulting 
network is thus given by N = Np + Nc +Nu nodes. By definition, a link may occur only 
between nodes of different partitions, thus these bipartite networks represent accurately 
the post-mediated interactions between the Web users. (Note that the post-mediated 
communication makes the networks of blog users essentially different from the familiar 
social networks on the Web, such as MySpace or Facebook, where users interact directly 
with eachother.) We keep information about the direction of the actions, specifically, a link 
ip ju indicates that user j reads the post i, while ju — > k c indicates that the user j writes 
the comment k. In the data each comment has an ID that clearly attributes it with a given 
post (original story). The emotional content e G [0, —1, +1] of the text appears as a property 
of each post-and-comment node. Fig.QJleft) shows an example of the accurate data mapping 
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onto directed bipartite network: it represents a network of one popular post from our dataset. 
Note two types of well connected nodes: the main post (white square visible in the lower 
part), and a user-hub (circle node visible in upper left part of the network), which indicates a 
very active user on that post. 

Mapping of the entire dataset results in a very large network. For different purposes, 
however, one can suitably reduce the network size. For instance, a monopartite projection 
on user-partition can be made, using the number of common posts per pair of users as a link 
ifToll . For the purpose of this work, we keep the bipartite representation while we compress 
the network to obtain a weighted bipartite network of the size N = Np +Nu, consisting of 
all popular posts and users attached to them. The weight W i; of a link is then given by the 
number of comments of the user i on the post j. A part of such network from our ddDiggs data 
is shown in Fig. [TJright). These networks exhibit very rich topology and interesting mixing 
patterns lfT9l [T6l (see also ll20l l2Tfl for similar networks constructed from the data of music 
and movie users). 

Based on the network representation and information about the emotional contents of the 
texts and the action times, here we perform quantitative analysis of the data. Specifically, we 
determine: 

• Community structure on the weighted bipartite networks, where a community consists of 
users and certain posts which are connecting them. Emotional contents of the comments 
by users in these communities are analyzed; 

• Temporal patterns of user actions for each individual user and for the detected user 
communities; Correlations between the evolution of a community with the emotional 
contents of the comments is monitored over time; 

• Avalanches of (emotional) comments, defined as sequences of comments of a given 
emotion which are mutually connected over the network and within a small time bin 
tt,in=5 minutes. 

3. Empirical Data Analysis 

3.1. User Communities and Emotions 

Time sequence of user activity on all posts, i.e., the number of comments within a small 
time bin t^n = 5 minutes is followed within the entire time period available in the dataset. 
Similarly, the time-series of the number of emotional comments, n e (t), and the number of 
negative/positive emotion comments, n±[t), is determined. An example is shown in Fig. 
l2ttop): Zoom of the initial part of the time- series is shown, indicating bursts (avalanches) in 
the number of comments (further analysis of the avalanches is given below in Figs. |4] and [3f). 
The occurrence of increased activity over a large period of time suggests possible formation of 
a user community around some posts. In this example, the intensive activity with avalanches 
of comments lasted over 2153 hours, followed by reduced activity with sporadic events for 
another 1076 hours. 
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Figure 1 : (left) One-post bipartite network of users-bullets and comments-squares marked by 
the emotional contents: red-positive, black-negative, white-neutral, (right) Part of a weighted 
bipartite network with users (bullets) and posts (squares). The widths of links are given by 
the number of comments of the user to the post. Color of the post node indicates overall 
emotional content of all comments on that post. 



Such communities can be accurately identified on the underlying network by different 
methods E2l l23l l24i Here we use the method based on the eigenvalue spectral analysis of 
the Laplacian operator Il26ll25l . which is related to the symmetrical weighted network {Wij} 
as 

^/ X>,- a) 

and £j represents the strength of node i. As described in detail in [|25l . the existence of 
communities in a network is visualized by the branched structure in the scatter-plot of the 
eigenvectors corresponding to the lowest non-zero eigenvalues of the Laplacian. The situation 
shown in Fig. [3^ is for the user-projected network of our dataset of popular diggs. In this case 
Wij represents the number of common posts per pair of users, users with strengths ti > 100 
are kept. Branches indicating three large communities are visible. By identifying the nodes' 
indexes within a branch, we obtain the list of users belonging to a community that the branch 
represents. In order to unravel what posts (and comments) keep a given community together, 
we perform the spectral analysis of the weighted bipartite network described above, where 
the matrix elements Wij represent the number of comments of a user i to post j. In this way 
we identify the list of user- and post-nodes belonging to a community. For each identified 
community we then select from the original data all comments by the users in that community, 
together with their time of appearance and the emotional contents. 

3.2. Temporal Patterns of User Behavior 

The actual evolution of an identified community can be retrieved from the data related to it. 
The procedure — network mapping, community structure finding, and identifying the active 
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Figure 2: (top) Time-series of the emotional comments-pale, and the comments classified 
as carrying negative emotions-black, in the data of popular digg stories, (bottom) Temporal 
fluctuations of the size, the number of comments, and the charge of the emotional comments 
of an user community on ddDiggs. 



Fluctuations in the number of different users (size) in one of the large communities 
identified in our ddDiggs dataset is shown in Fig. [^bottom) with the time interval of one 
day. Shown are also the fluctuations in the number of all comments, and the "charge" of 
emotional comments Q(t) = n + (t) — n_(?), where n + (t) and n_(?) stand for the number of 
comments of all users in that community on a given day that are classified as positive and 
negative, respectively. It is remarkable that, the increase in the number of users is closely 
correlated with the excess of the negative comments (critique) on the posts. In the supportive 
material [27] we give several snapshots of the networks, indicating the evolution stages of this 
particular community. 

Users activities on posted text exhibit robust features, which can be characterized by 
several quantities shown in Figs. |3£b,c,d,e). Specifically, pattern of a user activity represents 
a fractal set along the time axis, with the intervals At between two consecutive actions 
obeying a power-law distribution. In Fig. [3b the histogram P(At) averaged over all users 
in ddDiggs is shown. Consequently, the number of comments of a user within a given 
time bin is varying, that is shown graphically in the color plot in Fig. |3jl. The color code 
represents the charge of the emotional comments made by a given user within 12 hours time 
bin. Different users are marked by the indexes along y-axis and ordered by the time of first 
appearance in the dataset. Occurrence of the diagonal stripes indicate the activity that involves 
new users potentially related with the same story. (Note that mutually connected comments 
are accurately determined using the network representation, as discussed below). Another 
power-law dependence is found in the delay time t\ — ?o of comments made by anyone of 
the users to a given post, measured relative to its posting time to tfT6l l28l . In view of the 
emotional comments on a given post, in the present study it is interesting to consider the 
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(a) 



(b) 



(c) 




Figure 3: (a) Scatter plot of the eigenvectors corresponding to three lowest nonzero 
eigenvalues of the Laplacian in Eq. (OQ) indicating occurrence of user communities on popular 
diggs. (b) Distribution of time delay At between two consecutive user actions, averaged over 
all users on ddDiggs. Inset: Distribution of strength lu,lp for user and post nodes on the 
weighted bipartite network, part of which is shown in Fig. \T\ right, (c) Power- spectrum of 
the time-series in Fig. [2] top, with all emotional comments-pale, and comments classified 
as negative-black, (d) Activity pattern of a user community, color indicates charge of 
all comments by a user within time-bin of 12h; (e) Distribution of the delay between the 
emotional actions of users to a given post, averaged over all posts. Inset: Distribution of the 
probability a for a user to post a negative comment, (f) Distribution of the size of avalanches 
with negative-black and all emotional comments-pale, obtained from the same dataset as (c). 
[Log-binning with the base b = 1.1 is used for the distributions in figures (b),(c),(e),(f).] 



delay 8t = — t\ between two emotional comments. The histograms for the case of negative 
(positive) comments from our dataset is given in Fig. [3K averaged over all post in the dataset. 
Both distributions have a power-law tail with the slope ~ 1.5 for the delay time in the range 
8t G[24h,8wk] and a smaller slope ~ 1.25 for 8t G[5min,24h], indicated by dashed lines. 
However, differences and larger frequency of negative comments is found in the domain 
8t <5min. 

Further analysis of the time series reveals long-range correlations in the number of 
emotional comments. In particular, the power- spectrum of the type S(v) ~ 1/v is found, 
both for the number of all emotional comments and the number of comments with negative 
emotion of the time series from Fig. [2t top). The power- spectrum plots are shown in Fig. |3fc. 

3.3. Structure of the Emergent Critical States 

The observed correlations in the time series are indicative of bursting events, which are 
familiar to self-organized dynamical systems. In our case an avalanche represents a sequence 
of comments, i.e., a comment triggering more comments within a small time bin tu n , and 
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so on, until the activity eventually stops. In analogy to complex systems as the earthquakes 
||29l or Barkhausen noise OTTl . the avalanches can be readily determined from the measured 
time-series, like the one shown in Fig. [2t top). Specifically, putting a baseline on the level of 
random noise, an avalanche encloses the connected portion of the signal above the baseline. 
Thus the size of an avalanche in our case is given by the number of comments enclosed 
between two consecutive intersections of the corresponding signal with the baseline. The 
distribution of sizes of such avalanches is shown in Fig. [3f, determined from the signal of 
emotional comments from Fig.[2ttop). A power-law with the slope x s — 1.5 is found over two 
decades. 

The scale-invariance of avalanches is a signature of self-organizing critical (SOC) states 
113311321 in dynamical systems. Typically, a power-law distribution of the avalanche sizes 

P(s) ~ s~ Ts exp(-s/s ) (2) 

and other quantities pertinent to the dynamics ll34l l38l I3T1 can be measured before a natural 
cut-off sq, depending on the system size. The related measures, for instance the distribution 
of temporal distance between consecutive avalanches, P(8T), also exhibits a power-law 
dependence, as found in the earthquake dynamics Il29ll30ll . 
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Figure 4: Cumulative distribution of avalanche size P(s), and inactivity time P(5T) between 
avalanches of comments observed at individual posts on ddDiggs. The cases with comments 
of positive and negative emotional content are shown separately. 



Here we give evidence that the SOC states may occur in the events at individual posts 
in our dataset. The results of the cumulative distributions of the avalanche sizes, P(s), 
averaged over all 3984 posts, are shown in Fig. HI The distributions for avalanches of different 
emotional contents are fitted with the Eq. © with different exponents (x s e [1.0, 1.2]) and 
cut-offs. On the single-post networks we can also identify the quiescence times between 
consecutive avalanches, the distributions, P(8T), are also shown in Fig. ffl For comparison, 
the differential distribution of the avalanche sizes in Fig.[3f, which refers to the simultaneous 
activity on all posts, shows an excessive number of very large avalanches (supercriticality). 
Occurrence of different attractors inherent to the dynamics li35l l36ll or coalescence of 
simultaneously driven events (371 [38l may result in nonuniversal scaling exponents, which 



CONTENTS 



9 



depend on a parameter. Relevance of the conservation laws is still an opened question 11391 . 
The situation is even more complex for the dynamics on networks. Nevertheless, the SOC 
states have been identified in different processes on networks ll40l HTl |42| . In order to 
understand the origin of the critical states in the empirical data, and their dependence on the 
user behavior, in the following we design a cellular-automaton type model on the weighted 
post-user network, within which we identify the realistic parameters governing the dynamics 
and vary them. 

4. Modeling Avalanche Dynamics on Popular Posts 

The microscopic dynamics on Blogs, i.e., a user posting a comment, triggering more users for 
their actions, etc., can be formulated in terms of update rules and constraints, which affect the 
course of the process and thus the emergent global states. A minimal set of control parameters 
governing the dynamics with the emotional comments is described below and extracted from 
our empirical data of ddDiggs. Specifically: 

• User delay At to posted material, extracted from the data is given by a power-law tailed 
distribution P(At) in Fig. |3t>, with the slope Ta ~ 1 above the threshold time Ao ~ 300 
min; 

• User tendency to post a negative comment, measured by the probability a, inferred 
from the data as a fraction of negative comments among all comments by a given user. 
Averaged over all users in the dataset, the distribution P(oc) is given in the inset to Fig. 

Efc. 

• Post strength t\p is a topological measure uniquely defined on our weighted bipartite 
networks as a sum of all weights of its links, i.e., the number of users linked to it with 
multiplicity of their comments. Thus it is a measure of attractiveness (relevance) of the 
posted material. Histograms of the strengths of posts and users in our dataset are given 
in the inset to Fig. |3jx 

• User dissemination probability A is a measure of contingency of bloggers' activity. It is 
deduced from the empirical data as the average fraction of the users who are active more 
than once/at different posts within a small time bin = 5 min). In the model we vary 
this parameter, as explained below. 

• Network structures mapped from the real data at various instances of time underlying 
the evolution of connected events. Here we use the weighted bipartite network of the 
ddDiggs data, Fig.HJright). 

Within the network automaton model these parameters are implemented as follows: First, 
the weighted bipartite network is constructed from the selected data and given time interval. 
To each post on that network we associate its actual strength £ip, and to each user a (quenched) 
probability a taken from the actual distribution P(oc). A well connected user is selected to 
start the dynamics by posting a comment on one of its linked posts. The lists of active users 
and exposed posts are initiated. 
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Then at each time step all users linked along the network to the currently exposed posts 
are prompted for action. A prompted user takes its delay time At from the distribution P(At) 
of the actual dataset. Only the users who got At < tu n are considered as active within this 
time step and may comment one of the exposed posts along their network links. The posted 
comment is considered as negative with the probability a associated with that user, otherwise 
equal probability applies for the positive and objective comment. With the probability A each 
active user may make an additional comment to anyone of its linked (including unexposed) 
posts. The post strength is reduced by one with each received comment. Commented post are 
added to the list of currently exposed posts. In the next time step the activity starts again from 
the updated list of exposed posts, and so on. Note that the activity can stop when: (a) no user 
is active, i.e., due to long delay time At > t bin ; (b) strength of the targeted post is exhausted; 
(c) no network links occur between currently active areas. In the simulations presented here 
we vary the parameter A while the rest of the parameters are kept at their values inferred from 
the considered dataset, as described above. The resulting avalanches of all comments and 
of the positive/negative comments are identified. The distributions of the avalanche size and 
duration are shown in Fig.[5]for different values of the dissemination parameter A. 
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Figure 5: Cumulative distributions of the size (a) and duration (b) of avalanches of emotional 
comments simulated within the network automaton model for varied dissemination parameter 
A. Fixed parameters for the network structure, posts strengths, and user inclination towards 
negative comments and delay actions are determined from the ddDiggs dataset, as explained 
above. Inset: The distribution P(s) for all avalanches, and for avalanches of positive and 
negative comments for critical value of the dissemination parameter A = A c . 

The simulation results, averaged over several initial points, show that the power-law 
distributions © of the avalanche size occur for the critical value of the dissemination, 
A = A c ~ 2.1 x 10~ 4 for this particular dataset. Whereas, varying the parameter A in the 
simulations appear to have major effects on the bursting process. Specifically, the power-law 
becomes dominated by the cut-off for A < A c , indicating a subcritical behavior. Conversely, 
when A > A c , we observe excess of large avalanches, compatible with supercriticality . The 
critical behavior at A = A c has been confirmed by several other measures. The slopes of 
the distributions of size and suration, shown in Fig. \5\ are T s — 1 ~ 1.5 and Xj ~ 1.33, 
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respectively. The critical behavior persists, but with changed scaling exponents, when the 
other control parameters are varied. In particular, assuming the distribution of user-delay as 
P(At) ~ exp(— At/To) leads to the power-law avalanches with the slopes % s and Xj depending 
on the parameter To. The results are shown in the supporting material E71 . 

5. Conclusions 

We have analyzed a large dataset with discussion-driven comments on digg stories from 
digg.com as a complex dynamical system with emergent collective behavior of users. With the 
appropriate network mappings and using the methods and theoretical concepts of statistical 
physics combined with computer science methods for text analysis, we have performed 
quantitative study of the empirical data to: 

• Demonstrate how the social communities emerge with users interlinked via their 
comments over some popular stories; 

• Reveal that an important part of the driving mechanisms is rooted in the emotional actions 
of the users, overwhelmed by negative emotions (critiques); 

• Show that the bursting events with users' emotional comments exhibit significant self- 
organization with the critical states. 

Properties of the emergent collective states can be captured within a network-automaton 
model, where the real network structure and the parameters native to the studied dataset are 
taken. Despite several open theoretical problems related with the self-organized criticality 
on networks, the observed critical states appear to be quite robust when the parameters of 
users behavior are varied within the model. However, they are prone to overreaction with 
supercritical emotional avalanches triggered by a small fraction of very active users, who 
disseminate activity (and emotions) over different posts. Within our approach, the activities 
and related emotions of every user and of the identified user communities are traced in time 
and over the emerging network of their connections. In view of the complex dynamical 
systems, the statistical indicators of the collective states and the numerical values of the 
parameters governing the dynamics of cybercommunities are readily extracted from the 
empirical data. 
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Appendix A. 

Data- Collection: The Digg data set was collected through the website's publicly available 
which allows programmers to directly access the data stored at its servers, such 
as stories, comments, user profiles etc. The data set is comprised of a complete crawl 
spanning the months February, March and April 2009. It contains 1,195,808 stories, 1,646,153 
individual comments and 877,841 active users. More information can be found in 11431 . The 
data set is freely available for scientific research. 

Emotion Classifier:The emotion classifier is based on supervised machine-learning 
approach, according to which a general inductive process initially learns the characteristics 
of a class during a training phase, by observing the properties of a number of pre-classified 
documents, and applies the acquired knowledge to determine the best category for new, 
unseen documents ifTOl . Specifically, it represents an implementation of the hierarchical 
Language Model (h-LM) classifier [|44l 1431 . according to which a comment is initially 
classified as objective or subjective and in the latter case, as positive or negative. The h- 
LM classifier was trained on the BLOGS06 data set 11461 . which is a uncompressed 148GB 
crawl of approximately 100,000 blogs, a subset of which has been annotated by human 
assessors regarding whether they contain factual information or positive/negative opinions 
about specific entities, such as people, companies, films, etc. Because the resulting training 
data set is uneven, the probability thresholds for both classification tasks were optimized on a 
small subset of humanly annotated Digg comments, in a fashion similar to |[T8l . 
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